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Abstract 

To obtain the optimal number of communities is an important problem in detecting community structure. In this paper, 
we extend the measurement of community detecting algorithms to find the optimal community number. Based on the 
normalized mutual information index, which has been used as a measure for similarity of communities, a statistic 
£l(c) is proposed to detect the optimal number of communities. In general, when Q(c) reaches its local maximum, 
especially the first one, the corresponding number of communities c is likely to be optimal in community detection. 
Moreover, the statistic £2(c) can also measure the significance of community structures in complex networks, which 
has been paid more attention recently. Numerical and empirical results show that the index £2(c) is effective in both 
artificial and real world networks. 
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1. Introduction 

Community detection has become a very important part of researches on complex networks JHQIsl. Communities 
or modules mean high concentrations of edges within special groups of vertices, and low concentrations between these 
groups ||4l l5l la. LZD • Such communities have been observed in many different fields. For instance, community structure 
is a typical feature in social networks, where some of the individuals can be part of a tightly connected group, others 
can be completely isolated, while some others may act as bridges between groups. Tightly connected groups of 
nodes in the World Wide Web often correspond to pages on common topics, while communities in genetic networks 
are related to functional modules. Consequently, finding the communities within a network is a powerful tool for 
understanding the structure and the function of the network, and its growth mechanisms! 3]. 

The general aim of community detection is to find meaningful divisions into groups by investigating the structural 
properties of the whole graph. There are two major aspects about this problem. One concerns with proposing effective 
algorithms for detecting the communities, and the other concerns with the significance or robustness of the obtained 
divisions^ HI]- For the first aspect, many efficient heuristic methods have been proposed over the years to detect 
the communities in networks, in particular those based on spectral methods, divisive algorithms, modularity-based 
methods, dynamic algorithms, and many others. However, most existing algorithms are not able to get the optimal 
number directly. Communities are just the final product of the algorithm. The community number of each run may 
be different. For some previous algorithms which can get the "optimal" community number depend on modularity Q, 
including Greedy techniques, Simulated annealing, Extremal optimization, Spectral optimization etc. By assumption, 
high values of modularity indicate good partitions. So, the partition corresponding to maximum value on a given graph 
should be the best, or at least a good one. This is the main motivation for modularity maximization, perhaps the most 
popular class of methods to detect communities in graphs. However, exhaustive optimization of Q is impossible, due 
to the maximum is out of reach, as it has been proved that modularity optimization is an NP-complete problem [9], so it 
is probably impossible to find the solution in a time growing polynomially with the size of the graph. Moreover, Santo 
et al. have been strictly proved that even if modularity Q can get the maximum value, the corresponding community 
number may not be the optimal one lllOll . Most importantly, small changes of the number of communities may have 



1 yanqing .hu.sc@gmail.com 
- yfan @ bnu . edu . cn 



Preprint submitted to Physica A 



January 18, 2013 



a huge impact on the results of detecting the community structure. Therefore, it is essential to look for simple and 
effective methods of detecting the optimal number of communities. 

When we proceed the methods several times under the same condition, they may give different community struc- 
tures due to the random factors in the algorithm. Then evaluating the quality of a partition is also important in 
community identification. Newman described a method to calculate the sensitivity of algorithms ! 1 ill . Danon et al. 
proposed a measurement based on information theory [12]. These two measurements mainly focus on the proportion 
of nodes which are correctly grouped. Fan et al. investigated the accuracy and precision of several algorithms 111 31 . 
Accuracy means the consistence when the community structure from algorithm is compared with the presumed com- 
munities, and precision is the consistence among the community structures from different runs of an algorithm on the 
same network. They proposed a similarity function S to measure the difference between partitions. The function S 
integrates the information about the proportion of nodes co -appearance in pair groups of A, B and the total number 
of communities. Obviously, an "ideal" community detection should be one that both with high accuracy and high 
precision. In this paper, we propose a suitable method for evaluating the optimal number of the communities based on 
measuring the precision of algorithms and closely relating to the accuracy of algorithms. We first use the algorithm 
based on mixture model, which proposed by Newman and Leicht M14fl . to induce a sequence P(k)(l < k < n) of divi- 
sions into communities; Second, we measure the precision of the algorithm based on "information entropy", which 
has been used to evaluate the similarity of communities! T5I "l6l 17 , ish ; At last, we use our proposed index £2(c) to 
find the optimal number of the communities. Our statistic is an auxiliary method which can be applied to almost all 
the algorithms with random characteristics to help them find the optimal community number. It is relatively simple 
to apply the method. It will not increase the complexity of the algorithms, which just repeat the algorithm for several 
times. A point we should mention is that our method needn't to know the "real" community structures in advance. 

This paper is organized as follows: The method is described in detail in section 2. In section 3, we apply the 
method to several artificial and real networks, and find some interesting results. In section 4, some conclusions are 
given. 



2. Method 

First, for given number of groups c, we use the method based on mixture model to divide a network n times, then 
we can get n (in this paper, n = 50) divisions of the same network into communities. When we proceed the method 
several times under the same condition, they may give different community structures due to the random factors of 
the algorithm. So these divisions generally have different community structures, meaning that the communities in 
different divisions may include different nodes and edges. 

Second, we measure the precision of the algorithm based on comparing the similarity between the divisions. A 
number of indexes for measuring similarities or differences between partitions of a network have been proposed in 



the pas] JTlJHElBEllilliSIIIl]. Our method here follows the information theoretic methods. As described 
in a confusion matrix N was defined, where the rows correspond to the "real" communities in networks, 

and the columns correspond to the "found" communities. The element N(i, j) is the number of nodes in the real 
community i that appear in the found community j. Therefore a measure of similarity between the partitions A and B 

-2x; ; x? : .v i; /^(^ 7 » 

R(c) = I{A, B) = — — — 1 1 M (1) 

X^Nilogi^ + X^Njlogi^) 

Where A is the "real" community structure of the network, B denotes the divisions of the network, and ca, cb denote 
the numbers of "real" communities and "found" communities respectively. I(A,B) is to measure the accuracy of the 
algorithm. The larger I(A,B) is, the better the community structure from algorithm is consistent with the "real" one. 
We assume ca — cb — c, then I(A, B) can be simplified to 

!c(a, b) = tt- (2) 

XUNJogi^ + X^Njlogi^) 
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Figure 1; Comparing fl(c) (shown with real lines) and R(c) (shown with broken lines) in ad hoc networks with n = 128, and 4 communities of 32 
nodes each. The X axis denotes the number of communities. The plot shows the results for three different numbers of < kj ntra > corresponding to 
2, 4, or 6. Both Cl(c) and R(c) achieve their first maximum when the number of communities is 4. The result is averaged by 100 times. 



Both A and B in Eq.(2) are the divisions from the different runs of an algorithm on the same network. Then I c (a, b) 
could measure the precision of the algorithm. Here we use Eq.(2) to compare every two different divisions of the same 
network into communities. Then for each given c and n, we will get - values of I c (a, b). 
Then, we propose an index Q(c) as following, 

2fl=i,6=i, fl <6^(a,&) 2 ^ 
W = s = ^T) L &> 

2 a=\,b=l,a<b 

£2(c), actually, can also measure the precision of the algorithms. 

To show the effectiveness of our index Q(c), we compare it with R(c) in a variety of networks whose community 
structures are known. In general, the number of communities is much smaller than the number of nodes in a network. 
Then, when the number of communities is far fewer than the number of nodes, we find that Q(c) performs as well as 
R{c) in measuring the similarities between community structures. And we find that when it appears local maximum, 
especially the first maximum, the corresponding c is likely the optimal number of groups. 



3. Results 

In order to investigate the performance of our index £2(c), we compare these two indices in ad hoc networks and 
some real networks which have "known" community structures. To further measure the performance of Q(c), we 
apply it in several artificial hierarchical networks which have "unknown" community structures in advance, and we 
compare it with function Q in ER random networks. Finally, we intend to use Q(c) to measure the significance of 
community structure. 

3.1 Results of Binary ad hoc networks 

As a first test, we applied £2(c) to computer-generated random graphs with a well-known predetermined commu- 
nity structura lly . Each graph has N - 128 nodes divided into 4 communities of 32 nodes each. Edges between 
two nodes are introduced with different probabilities depending on whether the two nodes belong to the same group 
or not: every node has k mtra links on average to its neighbors in the same community and kj nter links to the outer 
world, keeping < fe,„, ra > + < fc;„ ter > =16. In Fig.l, we show the performs of £2(c) and R(c) in several Binary ad 
hoc networks. The curves correspond to different choices for different < k mlra >. As we can see from Fig.l, Q(c) 
performances as well as R(c), and when Q(c) reaches the first maximum value, the corresponding number of groups 
c = 4 is the optimal number of communities. 

3.2 Results on Zachary's karate club network 

The Zachary karate club network has been considered as a simple sample for community-detecting methodologies 
22 . 23 . 24, 25, 2^, 23]. This network is constructed with the data collected from observing 34 members of a karate 
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Figure 2: Comparing fl(c) (shown with square line) and R(c) (shown with diamond line) in the Zachary's karate club network. Both of them 
achieve the first maximum when the number of communities is 2. The result is averaged by 100 times. 



club over a period of 2 years and considering friendship between members. It has been proved that the best partition 
of this network has two communities by many previous algorithms!?, 14 , I^l 22]. We also apply D.(c) and R(c) to the 
Zachary karate club network. As shown in Fig. 2, £2(c) appears its first maximum value when c=2. According to our 
method, c - 2 is the optimal number of communities in the Zachary karate club network, which correspond to the 
results given inlf7l ll4Tl . 

3.3 Results on an LFR Benchmark 

As we have remarked above, all vertices of ad hoc networks have approximately the same degree and all com- 
munities have exactly the same size by construction. These two features are at odds with what is observed in graph 
representations of real systems. Degree distributions of real networks are usually skewed, with many vertices with 
low degree coexisting with a few vertices with high degree. Lancichinetti, Fortunato, and Radicchi considered that 
a good benchmark should not be assumed that all communities have the same size: the distribution of community 
sizes of real networks is also broad, with a tail that can be fairly well approximated by a power law. They introduced 
a class of benchmark graphs which account for the heterogeneity in the distributions of both degree and community 
size] 28 , 2^1 ■ Such benchmark is a more faithful approximation of real- world networks with community structure than 
simpler benchmarks like, e. g. that by Girvan and Newman. Through the method in|28], we get a class of LFR 
Benchmark networks and we apply index £2(c) to one of them. The result is shown in Fig. 3. As we can see from the 
figure, when c = 10, £2(c) reaches the first maximum value, which corresponds to the condition of the network we 
construct. In fact, £2(c) relates to the number of communities. When the community number c is large enough, D.(c) is 
bound to increase with the increase of c. When the community number is equal to the network size, Q(c) will be one. 
Our statistic is generally effective on the networks which their community number is much smaller than the vertex 
number, which is consistent with the characteristic of most real networks. 

3.4 Results on hierarchical networks 

Hierarchical networks have been mentioned in many literatures Marta Sales-Pardo et 

al. proposed a method to construct hierarchical nested random graphs 13411 . In this paper, we test our method on 
hierarchical artificial networks with two levels. Taking Fig. 4(a) as an example, we explain how we construct the 
hierarchical networks. We first create a network with 80 nodes that at the first level has two modules comprising 40 
nodes each. Once having assigned nodes to groups, we draw an edge between a pair of nodes (z, J) with probabilities 

(!■) Pi(p2 = tjX where is the average degree of the module at the second level and C2 is the number of nodes 
in the module), if (i, j) are in the same module at the second level; (2.) p\(p\ = where d\ is the average degree of 
the module at the first level and c\ is the number of nodes in the module), if (z, j) are in the same module at the first 
level; (3.) po(po — where do is the average degree of the module at the top level and Co is the number of nodes 
in the whole network, otherwise. We impose that p2 > p\ > po (here p2 = 0.85, p\ = 0.25, po = 0.01), then the 
resulting network has a larger density of connections between nodes grouped in the same submodule at the second 
level, a smaller density of connections between groups of nodes grouped in the same module at the first level, and an 
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Figure 3: Comparing £l(c) (shown with square line) and R(c) (shown with diamond line) in one of LFR benchmark networks, which has a skewed 
degree distribution, similar to real networks. The number of nodes N=500; Each node is given a degree taken from a power law distribution with 
exponent y=3; The average node degree < k >=20; Each node shares a fraction 1 -fi of its links with the other nodes of its community and a fraction 
/j with the other nodes of the network; /i=0.2 is the mixing parameter; The sizes of the communities are taken from a power law distribution with 
exponent /J=2. Through the method, we get the network and the number of communities c — 10. Both fi(c) and R(c) appear the first local maximum 
value when the number of communities is 10. 

even smaller density of connections between nodes grouped in separate modules at the top level. Thus, the network 
has by construction an artificial hierarchical organization. Fig. 4 shows the results of £2(c) in two of these kinds of 
networks. In Fig. 4, £2(c) appears several local maximum values, and we can find that the corresponding numbers 
of communities of the first two maximum values of £2(c) are well corresponding to the number of communities of 
different levels in hierarchical networks. However, £2(c) can not always detect the "optimal" community number of 
hierarchical networks effectively, but just with a certain probability. For example, we have calculated the probabilities 
of the above two hierarchical networks. For the 80-nodes network, the probabilities of obtaining a maximum when 
c=2 and c=4 are 0.91, 0.32 respectively. For the 1 80-nodes network, the probabilities of obtaining a maximum when 
c=3 and c=9 are 0.85, 0.21 respectively. What's more, as shown in fig. 4(b), £2(2) > £2(4); and as shown in fig.4(d), 
£2(3) > £2(9). This is a very interesting phenomenon, which means large community structures are more likely to be 
identified by the algorithms. 

3.5 Results on an ER random network 

Ref.(8j,l3Jil3_7t] show that high values of the modularity of Newman and Girvan does not necessarily indicate that a 
graph has a definite cluster structure. It in particular shows that partitions of random graphs may also achieve consid- 
erably large values of Q, although we do not expect them to have community structure, due to the lack of correlations 
between the linking probabilities of the vertices. We compare the index £2(c) with the modularity function Q in an 
ER network, which have 128 nodes and < k >= 1.5. The network is normally considered with indefinite community 
structure. We use the extremal optimization algorithm|]25| to detect the community structure of the network. It is 
divided into 14 communities and the maximal Q is 0.6056, which is large enough to consider the network has definite 
community structure. It means the modularity function Q doesn't performs well in networks which have indefinite 
community structures. We apply £2(c) to the network. As shown in Fig. 5, the statistic £2(c) doesn't appear maximums. 
According to our method, it can't find the optimal number of communities in the network, meaning that the network 
doesn't have definite community structure. Thus £2(c) may shed light on evaluating whether a network has definite 
community structure or not. 

3.6 Measuring the significance of community structures 

Many efficient methods have been proposed for finding communities, but few of them can evaluate the communi- 
ties found are significant or trivial definite ly|116|). In some works the concept of significance has been related to that 
of robustness or stability of a partition against random perturbations of the graph structure. The basic idea is that, if a 
partition is significant, it will be recovered even if the structure of the graph is modified, as long as the modification 
is not too extensive. Instead, if a partition is not significant, one expects that minimal modifications of the graph will 
suffice to disrupt the partition, so other clusterings are recovered! 8]. We apply the statistic £2(c) to evaluate the sig- 
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Figure 4: The performance of fi(c) in computer-generated hierarchical networks. In (a), (c), both the x axis and the y axis present the number of 
nodes n in the networks. In (b), (d), the x axis presents the number of communities. From the plots we can see that our method is also helpful for 
detecting the community structures at different levels in artificial hierarchical networks, (a) n = 80, po = 0.002, p\ = 0.1, pi = 0.85; (c) n = 180, 
po = 0.005, pi = 0.2, p 2 = 0.8; 




Figure 5: The performance of fl(c) in an ER random network, which has N=128 nodes and < k >= 1.5. This random network is normally 
considered with no definite community structure. However, the maximum of modularity function Q calculated by extremal optimization algorithm 
is 0.6056, which is large enough to consider the network has definite community structure. fi(c) doesn't appear maximums. It means that the 
network doesn't have definite community structure, which correspond with the real condition. The result is averaged by 100 times. 
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Figure 6: The performance of Q(c) (shown with blue square line) and R(c) (shown with red diamond line) in measuring the significance of 
community structures in computer-generated networks. Both fl(c) and R(c) are descending with the increase of < kj ntra > due to the community 
structures becoming fuzzier and fuzzier. The X axis gives the change of < >. The result is averaged by 100 times. 

nificance of communities in binary ad hoc networks. As described in 3.1, each of these computer-generated networks 
has N = 128 nodes divided into 4 communities of 32 nodes each and we know the optimal number of communities of 
them is 4. By treating c — 4, we apply £2(4) and R(4) to measure the significance of these networks. With < k mtra > 
increasing, the community structure of the network will become less and less significative. As shown in Fig. 6, both 
£2(4) and R(4) are descending with the increase of < ki ntra >, which indicates £2(c) performs well in measuring the 
significance of community structures in complex networks. 

4. Conclusion and Discussion 

The investigation of the optimal number of communities in a network is an important and tough issue in the 
study of complex networks. In this paper, we present a method to detect the optimal number of communities in 
complex networks based on the information theoretic ideas. We apply the index £2(c) to some networks, including 
artificial networks and real networks with well-known community structures. The results show that when the number 
of communities is much smaller than the number of nodes, the index is effective on normal networks, LFR benchmark 
networks and hierarchical community structure networks. For hierarchical networks, £2(c) can just detect the "op- 
timal" community number of hierarchical networks with a certain probability and we have found a very interesting 
phenomenon, which large community structures are more likely to be identified by the algorithms. This phenomenons 
will be further considered in our future work. Moreover, the index £2(c) can be used to measure the significance 
of community structure which has been paid much attention recently. The statistic can nearly work based on every 
community detecting algorithm with the character of randomization, and it will not change the complexity of the 
algorithm. 
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